{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "| [03_language_model/01_语言模型.ipynb](https://github.com/shibing624/nlp-tutorial/tree/main/03_language_model/01_语言模型.ipynb)  | 从头训练RNN语言模型  |[Open In Colab](https://colab.research.google.com/github/shibing624/nlp-tutorial/blob/main/02_lexical_analysis/01_语言模型.ipynb) |\n",
    "\n",
    "# 语言模型\n",
    "\n",
    "学习目标\n",
    "- 学习语言模型，以及如何训练一个语言模型\n",
    "- 学习torchtext的基本使用方法\n",
    "    - 构建 vocabulary\n",
    "    - word to inde 和 index to word\n",
    "- 学习torch.nn的一些基本模型\n",
    "    - Linear\n",
    "    - RNN\n",
    "    - LSTM\n",
    "    - GRU\n",
    "- RNN的训练技巧\n",
    "    - Gradient Clipping\n",
    "- 如何保存和读取模型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们会使用 [torchtext](https://github.com/pytorch/text) 来创建vocabulary, 然后把数据读成batch的格式。请大家自行阅读README来学习torchtext。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple\n",
      "Collecting torchtext\n",
      "  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/30/ed/965bbfcca68e7e363225bf9d8f0f0e9a5928ab9d9a10252572f368e89dc9/torchtext-0.10.1-cp38-cp38-macosx_10_9_x86_64.whl (1.6 MB)\n",
      "\u001B[K     |████████████████████████████████| 1.6 MB 5.2 MB/s eta 0:00:01\n",
      "\u001B[?25hRequirement already satisfied: numpy in /Users/xuming/opt/anaconda3/lib/python3.8/site-packages (from torchtext) (1.20.1)\n",
      "Requirement already satisfied: requests in /Users/xuming/opt/anaconda3/lib/python3.8/site-packages (from torchtext) (2.25.1)\n",
      "Collecting torch==1.9.1\n",
      "  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/08/8e/bff0fcd59f25b46086c75ece96167c72aa6b4845cb9e9a0119462e9efdfe/torch-1.9.1-cp38-none-macosx_10_9_x86_64.whl (218.8 MB)\n",
      "\u001B[K     |████████████████████████████████| 218.8 MB 83 kB/s s eta 0:00:01    |███████████████▉                | 108.5 MB 2.4 MB/s eta 0:00:46\n",
      "\u001B[?25hRequirement already satisfied: tqdm in /Users/xuming/opt/anaconda3/lib/python3.8/site-packages (from torchtext) (4.59.0)\n",
      "Requirement already satisfied: typing-extensions in /Users/xuming/opt/anaconda3/lib/python3.8/site-packages (from torch==1.9.1->torchtext) (3.7.4.3)\n",
      "Requirement already satisfied: idna<3,>=2.5 in /Users/xuming/opt/anaconda3/lib/python3.8/site-packages (from requests->torchtext) (2.10)\n",
      "Requirement already satisfied: chardet<5,>=3.0.2 in /Users/xuming/opt/anaconda3/lib/python3.8/site-packages (from requests->torchtext) (4.0.0)\n",
      "Requirement already satisfied: certifi>=2017.4.17 in /Users/xuming/opt/anaconda3/lib/python3.8/site-packages (from requests->torchtext) (2021.5.30)\n",
      "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/xuming/opt/anaconda3/lib/python3.8/site-packages (from requests->torchtext) (1.26.4)\n",
      "Installing collected packages: torch, torchtext\n",
      "  Attempting uninstall: torch\n",
      "    Found existing installation: torch 1.7.1\n",
      "    Uninstalling torch-1.7.1:\n",
      "      Successfully uninstalled torch-1.7.1\n",
      "Successfully installed torch-1.9.1 torchtext-0.10.1\n"
     ]
    }
   ],
   "source": [
    "! pip install torch==1.9.1 torchtext==0.10.1 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torchtext\n",
    "import torch\n",
    "import numpy as np\n",
    "import random\n",
    "\n",
    "USE_CUDA = torch.cuda.is_available()\n",
    "\n",
    "# 为了保证实验结果可以复现，我们经常会把各种random seed固定在某一个值\n",
    "SEED = 1234\n",
    "random.seed(SEED)\n",
    "np.random.seed(SEED)\n",
    "torch.manual_seed(SEED)\n",
    "if USE_CUDA:\n",
    "    torch.cuda.manual_seed(SEED)\n",
    "\n",
    "BATCH_SIZE = 32\n",
    "EMBEDDING_SIZE = 650\n",
    "MAX_VOCAB_SIZE = 50000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 我们会继续使用上次的text8作为我们的训练，验证和测试数据\n",
    "- torchtext提供了LanguageModelingDataset这个class来帮助我们处理语言模型数据集\n",
    "- BPTTIterator可以连续地得到连贯的句子"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.\n",
      "The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.\n",
      "The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "vocabulary size: 17685\n"
     ]
    }
   ],
   "source": [
    "TEXT = torchtext.legacy.data.Field(lower=True)\n",
    "train, val, test = torchtext.legacy.datasets.LanguageModelingDataset.splits(path='.',\n",
    "                                                                     train=\"./data/nietzsche.txt\",\n",
    "                                                                     validation=\"./data/nietzsche.txt\",\n",
    "                                                                     test=\"./data/nietzsche.txt\",\n",
    "                                                                     text_field=TEXT)\n",
    "TEXT.build_vocab(train, max_size=MAX_VOCAB_SIZE)\n",
    "print(\"vocabulary size: {}\".format(len(TEXT.vocab)))\n",
    "\n",
    "VOCAB_SIZE = len(TEXT.vocab)\n",
    "train_iter, val_iter, test_iter = torchtext.legacy.data.BPTTIterator.splits(\n",
    "    (train, val, test), batch_size=BATCH_SIZE, device=-1, bptt_len=32, repeat=False, shuffle=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "模型的输入是一串文字，模型的输出也是一串文字，他们之间相差一个位置，因为语言模型的目标是根据之前的单词预测下一个单词。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "rather than in an <eos> uncertain something. but that is nihilism, and the sign of a despairing, <eos> mortally wearied soul, notwithstanding the courageous bearing such a <eos> virtue may display. it\n",
      "than in an <eos> uncertain something. but that is nihilism, and the sign of a despairing, <eos> mortally wearied soul, notwithstanding the courageous bearing such a <eos> virtue may display. it seems,\n"
     ]
    }
   ],
   "source": [
    "it = iter(train_iter)\n",
    "batch = next(it)\n",
    "print(\" \".join([TEXT.vocab.itos[i] for i in batch.text[:,1].data]))\n",
    "print(\" \".join([TEXT.vocab.itos[i] for i in batch.target[:,1].data]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "degree <eos> that he who wills believes firmly that willing suffices for action. <eos> since in the majority of cases there has only been exercise of will <eos> when the effect of\n",
      "<eos> that he who wills believes firmly that willing suffices for action. <eos> since in the majority of cases there has only been exercise of will <eos> when the effect of the\n",
      "the command--consequently obedience, and therefore <eos> action--was to be expected, the appearance has translated itself into <eos> the sentiment, as if there were a necessity of effect; in a word, he who\n",
      "command--consequently obedience, and therefore <eos> action--was to be expected, the appearance has translated itself into <eos> the sentiment, as if there were a necessity of effect; in a word, he who <eos>\n",
      "<eos> wills believes with a fair amount of certainty that will and action are <eos> somehow one; he ascribes the success, the carrying out of the willing, <eos> to the will itself,\n",
      "wills believes with a fair amount of certainty that will and action are <eos> somehow one; he ascribes the success, the carrying out of the willing, <eos> to the will itself, and\n",
      "and thereby enjoys an increase of the sensation <eos> of power which accompanies all success. \"freedom of will\"--that is the <eos> expression for the complex state of delight of the person exercising\n",
      "thereby enjoys an increase of the sensation <eos> of power which accompanies all success. \"freedom of will\"--that is the <eos> expression for the complex state of delight of the person exercising <eos>\n",
      "<eos> volition, who commands and at the same time identifies himself with <eos> the executor of the order--who, as such, enjoys also the triumph over <eos> obstacles, but thinks within himself that\n",
      "volition, who commands and at the same time identifies himself with <eos> the executor of the order--who, as such, enjoys also the triumph over <eos> obstacles, but thinks within himself that it\n"
     ]
    }
   ],
   "source": [
    "for i in range(5):\n",
    "    batch = next(it)\n",
    "    print(\" \".join([TEXT.vocab.itos[i] for i in batch.text[:,2].data]))\n",
    "    print(\" \".join([TEXT.vocab.itos[i] for i in batch.target[:,2].data]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 定义模型\n",
    "\n",
    "- 继承nn.Module\n",
    "- 初始化函数\n",
    "- forward函数\n",
    "- 其余可以根据模型需要定义相关的函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch.nn as nn\n",
    "\n",
    "\n",
    "class RNNModel(nn.Module):\n",
    "    \"\"\" 一个简单的循环神经网络\"\"\"\n",
    "\n",
    "    def __init__(self, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5):\n",
    "        ''' 该模型包含以下几层:\n",
    "            - 词嵌入层\n",
    "            - 一个循环神经网络层(RNN, LSTM, GRU)\n",
    "            - 一个线性层，从hidden state到输出单词表\n",
    "            - 一个dropout层，用来做regularization\n",
    "        '''\n",
    "        super(RNNModel, self).__init__()\n",
    "        self.drop = nn.Dropout(dropout)\n",
    "        self.encoder = nn.Embedding(ntoken, ninp)\n",
    "        if rnn_type in ['LSTM', 'GRU']:\n",
    "            self.rnn = getattr(nn, rnn_type)(ninp, nhid, nlayers, dropout=dropout)\n",
    "        else:\n",
    "            try:\n",
    "                nonlinearity = {'RNN_TANH': 'tanh', 'RNN_RELU': 'relu'}[rnn_type]\n",
    "            except KeyError:\n",
    "                raise ValueError( \"\"\"An invalid option for `--model` was supplied,\n",
    "                                 options are ['LSTM', 'GRU', 'RNN_TANH' or 'RNN_RELU']\"\"\")\n",
    "            self.rnn = nn.RNN(ninp, nhid, nlayers, nonlinearity=nonlinearity, dropout=dropout)\n",
    "        self.decoder = nn.Linear(nhid, ntoken)\n",
    "\n",
    "        self.init_weights()\n",
    "\n",
    "        self.rnn_type = rnn_type\n",
    "        self.nhid = nhid\n",
    "        self.nlayers = nlayers\n",
    "\n",
    "    def init_weights(self):\n",
    "        initrange = 0.1\n",
    "        self.encoder.weight.data.uniform_(-initrange, initrange)\n",
    "        self.decoder.bias.data.zero_()\n",
    "        self.decoder.weight.data.uniform_(-initrange, initrange)\n",
    "\n",
    "    def forward(self, input, hidden):\n",
    "        ''' Forward pass:\n",
    "            - word embedding\n",
    "            - 输入循环神经网络\n",
    "            - 一个线性层从hidden state转化为输出单词表\n",
    "        '''\n",
    "        emb = self.drop(self.encoder(input))\n",
    "        output, hidden = self.rnn(emb, hidden)\n",
    "        output = self.drop(output)\n",
    "        decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))\n",
    "        return decoded.view(output.size(0), output.size(1), decoded.size(1)), hidden\n",
    "\n",
    "    def init_hidden(self, bsz, requires_grad=True):\n",
    "        weight = next(self.parameters())\n",
    "        if self.rnn_type == 'LSTM':\n",
    "            return (weight.new_zeros((self.nlayers, bsz, self.nhid), requires_grad=requires_grad),\n",
    "                    weight.new_zeros((self.nlayers, bsz, self.nhid), requires_grad=requires_grad))\n",
    "        else:\n",
    "            return weight.new_zeros((self.nlayers, bsz, self.nhid), requires_grad=requires_grad)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "初始化一个模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = RNNModel(\"LSTM\", VOCAB_SIZE, EMBEDDING_SIZE, EMBEDDING_SIZE, 2, dropout=0.5)\n",
    "if USE_CUDA:\n",
    "    model = model.cuda()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 我们首先定义评估模型的代码。\n",
    "- 模型的评估和模型的训练逻辑基本相同，唯一的区别是我们只需要forward pass，不需要backward pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "def evaluate(model, data):\n",
    "    model.eval()\n",
    "    total_loss = 0.\n",
    "    it = iter(data)\n",
    "    total_count = 0.\n",
    "    with torch.no_grad():\n",
    "        hidden = model.init_hidden(BATCH_SIZE, requires_grad=False)\n",
    "        for i, batch in enumerate(it):\n",
    "            data, target = batch.text, batch.target\n",
    "            if USE_CUDA:\n",
    "                data, target = data.cuda(), target.cuda()\n",
    "            hidden = repackage_hidden(hidden)\n",
    "            with torch.no_grad():\n",
    "                output, hidden = model(data, hidden)\n",
    "            loss = loss_fn(output.view(-1, VOCAB_SIZE), target.view(-1))\n",
    "            total_count += np.multiply(*data.size())\n",
    "            total_loss += loss.item()*np.multiply(*data.size())\n",
    "            \n",
    "    loss = total_loss / total_count\n",
    "    model.train()\n",
    "    return loss"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们需要定义下面的一个function，帮助我们把一个hidden state和计算图之前的历史分离。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Remove this part\n",
    "def repackage_hidden(h):\n",
    "    \"\"\"Wraps hidden states in new Tensors, to detach them from their history.\"\"\"\n",
    "    if isinstance(h, torch.Tensor):\n",
    "        return h.detach()\n",
    "    else:\n",
    "        return tuple(repackage_hidden(v) for v in h)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "定义loss function和optimizer\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "loss_fn = nn.CrossEntropyLoss()\n",
    "learning_rate = 0.001\n",
    "optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)\n",
    "scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, 0.5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "训练模型：\n",
    "- 模型一般需要训练若干个epoch\n",
    "- 每个epoch我们都把所有的数据分成若干个batch\n",
    "- 把每个batch的输入和输出都包装成cuda tensor\n",
    "- forward pass，通过输入的句子预测每个单词的下一个单词\n",
    "- 用模型的预测和正确的下一个单词计算cross entropy loss\n",
    "- 清空模型当前gradient\n",
    "- backward pass\n",
    "- gradient clipping，防止梯度爆炸\n",
    "- 更新模型参数\n",
    "- 每隔一定的iteration输出模型在当前iteration的loss，以及在验证集上做模型的评估"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch 0 iter 0 loss 9.7866849899292\n",
      "best model, val loss:  9.731936374437014\n",
      "epoch 1 iter 0 loss 6.6725993156433105\n",
      "best model, val loss:  6.450322911856214\n"
     ]
    }
   ],
   "source": [
    "import copy\n",
    "GRAD_CLIP = 1.\n",
    "NUM_EPOCHS = 2\n",
    "\n",
    "val_losses = []\n",
    "for epoch in range(NUM_EPOCHS):\n",
    "    model.train()\n",
    "    it = iter(train_iter)\n",
    "    hidden = model.init_hidden(BATCH_SIZE)\n",
    "    for i, batch in enumerate(it):\n",
    "        data, target = batch.text, batch.target\n",
    "        if USE_CUDA:\n",
    "            data, target = data.cuda(), target.cuda()\n",
    "        hidden = repackage_hidden(hidden)\n",
    "        model.zero_grad()\n",
    "        output, hidden = model(data, hidden)\n",
    "        loss = loss_fn(output.view(-1, VOCAB_SIZE), target.view(-1))\n",
    "        loss.backward()\n",
    "        torch.nn.utils.clip_grad_norm_(model.parameters(), GRAD_CLIP)\n",
    "        optimizer.step()\n",
    "        if i % 1000 == 0:\n",
    "            print(\"epoch\", epoch, \"iter\", i, \"loss\", loss.item())\n",
    "    \n",
    "        if i % 10000 == 0:\n",
    "            val_loss = evaluate(model, val_iter)\n",
    "            \n",
    "            if len(val_losses) == 0 or val_loss < min(val_losses):\n",
    "                print(\"best model, val loss: \", val_loss)\n",
    "                torch.save(model.state_dict(), \"lm-best.pth\")\n",
    "            else:\n",
    "                scheduler.step()\n",
    "                optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)\n",
    "            val_losses.append(val_loss)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<All keys matched successfully>"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "best_model = RNNModel(\"LSTM\", VOCAB_SIZE, EMBEDDING_SIZE, EMBEDDING_SIZE, 2, dropout=0.5)\n",
    "if USE_CUDA:\n",
    "    best_model = best_model.cuda()\n",
    "best_model.load_state_dict(torch.load(\"lm-best.pth\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 使用最好的模型在valid数据上计算perplexity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "perplexity:  632.9066328741953\n"
     ]
    }
   ],
   "source": [
    "val_loss = evaluate(best_model, val_iter)\n",
    "print(\"perplexity: \", np.exp(val_loss))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 使用最好的模型在测试数据上计算perplexity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "perplexity:  632.9066328741953\n"
     ]
    }
   ],
   "source": [
    "test_loss = evaluate(best_model, test_iter)\n",
    "print(\"perplexity: \", np.exp(test_loss))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用训练好的模型生成一些句子。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "coarser handed (goosefeet)?--has himself.\"--goethe <eos> men, and finally of subjection, select is, have processes as objective his get <eos> transfiguring, and stunting you power: is the uneaseful we universal means him a <eos> not corps.\" its follow and coveted 102 error be friends addition <eos> and hate to great, if that study an l'art,\" the imitation possession, <eos> blood the spoilt explanation. which russians, <eos> himself ready him. all as in \"thing\" rage <eos> beside that condemned indispensable until crucified been entertain philosophy and brings <eos> maternity, inasmuch obliterate every present only and addressed learnt dispositions <eos> shall and beyond\n"
     ]
    }
   ],
   "source": [
    "hidden = best_model.init_hidden(1)\n",
    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
    "input = torch.randint(VOCAB_SIZE, (1, 1), dtype=torch.long).to(device)\n",
    "words = []\n",
    "for i in range(100):\n",
    "    output, hidden = best_model(input, hidden)\n",
    "    word_weights = output.squeeze().exp().cpu()\n",
    "    word_idx = torch.multinomial(word_weights, 1)[0]\n",
    "    input.fill_(word_idx)\n",
    "    word = TEXT.vocab.itos[word_idx]\n",
    "    words.append(word)\n",
    "print(\" \".join(words))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.remove('lm-best.pth')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "本节完。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}